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Scalable File Server with Highly Available Pairs 



Background of the Invention 



112 1. Field of the Invention 



The invention relates to storage systems. 



16 2. Related Art 



Computer storage systems are used to record and retrieve data. One way 

19 storage systems are characterized is by the amount of storage capacity they have. The 

20 capacity for storage systems has increased greatly over time. One problem in the known 

21 art is the difficulty of planning ahead for desired increases in storage capacity. A related 

22 problem in the known art is the difficulty in providing scalable storage at a relatively ef- 
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1 ficient cost. This has subjected customers to a dilemma; one can either purchase a file 

2 system with a single large file server, or purchase a file system with a number of smaller 

3 file servers. 
4 

5 The single-server option has several drawbacks. (1) The customer must 

6 buy a larger file system than currently desired, so as to have room available for future ex- 

7 pansion. (2) The entire file system can become unavailable if the file server fails for any 
^ 8 reason. (3) The file system, although initially larger, is not easily scalable if the customer 
y 9 comes to desire a system that is larger than originally planned capacity. 

UJio 

The multi-server option also has several drawbacks. In systems in which 

□12 the individual components of the multi-server device are tightly coordinated, (1) the same 

|^13 scalability problem occurs for the coordinating capacity for the individual components. 

fjl4 That is, the customer must buy more coordinating capacity than currently desired, so as to 

15 have room available for future expansion. (2) The individual components are themselves 

16 often obsolete by the time the planned-for greater capacity is actually needed. (3) Tightly 

17 coordinated systems are often very expensive relative to the amount of scalability de- 
ls sired. 

19 

20 In systems in which the individual components of the multi- server device 

21 are only loosely coordinated, it is difficult to cause the individual components to behave 

22 in a coordinated manner so as to emulate a single file server. Although failure of a single 

2 
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file server does not cause the entire file system to become unavailable, it does cause any 
files stored on that particular file server to become unavailable. If those files were critical 
to operation of the system, or some subsystem thereof, the applicable system or subsys- 
tem will be unavailable as a result. Administrative difficulties generally increase to due 
to a larger number of smaller file servers. 

Accordingly, it would be advantageous to provide a method and system for 
performing a file server system that is scalable, that is, which can be increased in capacity 
without major system alterations, and which is relatively cost efficient with regard to that 
scalability. This advantage is achieved in an embodiment of the invention in which a plu- 
rality of file server nodes (each a pair of file servers) are interconnected. Each file server 
node has a pair of controllers for simultaneously controlling a set of storage elements 
such as disk drives. File server commands are routed among file server nodes to the file 
server node having control of applicable storage elements, and in which each pair of file 
servers is reliable due to redundancy. 

It would also be advantageous to provide a storage system that is resistant 
to failures of individual system elements, and that can continue to operate after any single 
point of failure. This advantage is achieved in an embodiment of the invention like that 
described in co-pending Application Serial No. 09/037,652 filed March 10,1998, Ex- 
press Mail Mailing No. EE143637441US, in the name of the same inventor, titled 
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1 Available File Servers", attorney docket number NAP-012, hereby incorporated by refer- 

2 ence as if fully set forth herein, 

3 

4 Summary of the Invention 

5 

6 The invention provides a file server system and a method for operating that 

7 system, which is easily scalable in number and type of individual components. A plural- 
!S *.8 ity of file server nodes (each a pair of file servers) are coupled using inter-node connec- 

!ij9 tivity, such as an inter-node network, so that any one pair can be accessed from any other 

% pair. Each file server node includes a pair of file servers, each of which has a memory 

'Ml and each of which conducts file server operations by simultaneously writing to its own 

Q2 memory and to that of its twin, the pair being used to simultaneously control a set of stor- 
es age elements such as disk drives. File server commands or requests directed to particular 

,j§4 mass storage elements are routed among file server nodes using an inter-node switch and 

15 processed by the file server nodes controlling those particular storage elements. Each file 

16 server node (that is, each pair of file servers) is reliable due to its own redundancy. 

17 

18 In a preferred embodiment, the mass storage elements are disposed and 

19 controlled to form a redundant array, such as a RAID storage system. The inter-node 

20 network and inter-node switch are redundant, and file server commands or requests ar- 

21 riving at the network of pairs are coupled using the network and the switch to the appro- 

22 priate pair and processed at that pair. Thus, each pair can be reached from each other 

4 
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1 pair, and no single point of failure prevents access to any individual storage element. The 

2 file servers are disposed and controlled to recognize failures of any single element in the 

3 file server system and to provide access to all mass storage elements despite any such 

4 failures. 

5 

6 Brief Description of the Drawings 

7 

8 Figure 1 shows a block diagram of a scalable and highly available file 

9 server system. 
10 

n Figure 2A shows a block diagram of a first interconnect system for the file 

12 server system. 

13 

14 Figure 2B shows a block diagram of a second interconnect system for the 

15 file server system. 
16 

17 Figure 3 shows a process flow diagram of operation of the file server sys- 

18 tern. 
19 
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l Detailed Description of the Preferred Embodiment 

2 

3 In the following description, a preferred embodiment of the invention is de- 

4 scribed with regard to preferred process steps and data structures. However, those skilled 

5 in the art would recognize, after perusal of this application, that embodiments of the in- 

6 vention may be implemented using one or more general purpose processors (or special 

7 purpose processors adapted to the particular process steps and data structures) operating 

8 under program control, and that implementation of the preferred process steps and data 

9 structures described herein using such equipment would not require undue experimenta- 

10 tion or further invention. 
11 

12 Inventions described herein can be used in conjunction with inventions de- 

13 scribed in the following applications: 

14 ' 

15 • Application Serial No. 09/037,652, filed March 10, 1998, Express Mail Mailing 

16 No. EE143637441US, in the name of the same inventor, titled "Scalable and 

17 Highly Available File Server", attorney docket number NAP-0 12. 

18 

19 This application is hereby incorporated by reference as if fully set forth 

20 herein. It is herein referred to as the "Availability Disclosure." 
21 
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l File Server System 

2 

3 Figure 1 shows a block diagram of a scalable and highly available file 

4 server system. 

5 

6 A file server system 100 includes a set of file servers 110, each including a 

7 coupled pair of file server nodes 1 1 1 having co-coupled common sets of mass storage de- 

8 vices 112. Each node 1 1 1 is like the file server node further described in the Availability 

9 Disclosure. Each node 1 1 1 is coupled to a common interconnect 120. Each node 1 1 1 is 
10 also coupled to a first network switch 130 and a second network switch 130. 

11 

12 Each node 1 1 1 is coupled to the common interconnect 120, so as to be able 

13 to transmit information between any two file servers 110. The common interconnect 120 

14 includes a set of communication links (not shown) which are redundant in the sense that 

15 even if any single communication link fails, each node 1 1 1 can still be contacted by each 

16 other node 1 1 1 . 

17 

18 In a preferred embodiment, the common interconnect 120 includes a 

19 NUMA (non-uniform memory access) interconnect, such as the SCI interconnect oper- 

20 ating at 1 gigabyte per second or the SCI-lite interconnect operating at 125 megabytes per 

21 second. 

22 

7 
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1 Each file server 1 10 is coupled to the first network switch 130, so as to re- 

2 ceive and respond to file server requests transmitted therefrom. In a preferred embodi- 

3 ment there is also a second network switch 130, although the second network switch 130 

4 is not required for operation of the file server system 100. Similar to the first network 

5 switch 130, each file server 1 10 is coupled to the second network switch 130, so as to re- 

6 ceive and respond to file server requests transmitted therefrom. 
7 

; ^ 8 File Server System Operation 

;4o In operation of the file server system 100, as further described herein, a se- 

jSil quence of file server requests arrives at the first network switch 130 or, if the second 

32 network switch 130 is present, at either the first network switch 130 or the second net- 

\B3 work switch 130. Either network switch 130 routes each file server request in its se- 

; fil4 quence to the particular file server 110 that is associated with the particular mass storage 

15 device needed for processing the file server request. 

16 

17 One of the two nodes 1 1 1 at the designated file server 110 services the file 

18 server request and makes a file server response. The file server response is routed by one 

19 of the network switches 130 back to a source of the request. 

20 
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I Interconnect System 
2 

3 Figure 2A shows a block diagram of a first interconnect system for the file 

4 server system. 
5 

6 In a first preferred embodiment, the interconnect 120 includes a plurality of 

7 nodes 111, each of which is part of a file server 1 10. The nodes 1 1 1 are each disposed on 

8 a communication ring 211. Messages are transmitted between adjacent nodes 111 on 

9 each ring 211. 

10 

II In this first preferred embodiment, each ring 211 comprises an SCI (Scal- 

12 able Coherent Interconnect) network according to IEEE standard 1596-1992, or an SCI- 

13 lite network according to IEEE standard 1394.1. Both IEEE standard 1596-1992 and 

14 IEEE standard 1394.1 support remote memory access and DMA; the combination of 

15 these features is often called NUMA (non-uniform memory access). SCI networks oper- 

16 ate at a data transmission rate of about 1 gigabyte per second; SCI-lite networks operate 

17 at a data transmission rate of about 125 megabytes per second. 
18 

19 A communication switch 212 couples adjacent rings 211. The communi- 

20 cation switch 212 receives and transmits messages on each ring 211, and operates to 

21 bridge messages from a first ring 211 to a second ring 211. The communication switch 

22 212 bridges those messages that are transmitted on the first ring 21 1 and designated for 

9 
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transmission to the second ring 21 1. A switch 212 can also be coupled directly to a file 
server node 110. 

In this first preferred embodiment, each ring 21 1 has a single node 1 1 1, so 
as to prevent any single point of failure (such as failure of the ring 21 1 or its switch 212) 
from preventing communication to more than one node 111. 

Figure 2B shows a block diagram of a second interconnect system for the 
file server system. 

In a second preferred embodiment, the interconnect 120 includes a plurality 
of nodes 111, each of which is part of a file server 110. Each node 1 1 1 includes an asso- 
ciated network interface element 114. In a preferred embodiment, the network interface 
element 1 14 for each node 1 1 1 is like that described in the Availability Disclosure. 

The network interface elements 114 are coupled using a plurality of com- 
munication links 221, each of which couples two network interface elements 114 and 
communicates messages therebetween. 

The network interface elements 114 have sufficient communication links 
221 to form a redundant communication network, so as to prevent any single point of 

10 
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1 failure (such as failure of any one network interface element 114) from preventing com- 

2 munication to more than one node 111. 

3 

4 In this second preferred embodiment, the network interface elements 114 

5 are disposed with the communication links 221 to form a logical torus, in which each 

6 network interface element 114 is disposed on two logically orthogonal communication 

7 rings using the communication links 22 1 . 

j7j9 In this second preferred embodiment, each of the logically orthogonal 

L-tfo communication rings comprises an SCI network or an SCI-lite network, similar to the 

111 SCI network or SCI-lite network described with reference to figure 2 A. 

■ 533' 

j §3 Operation Process Flow 

%4 



15 Figure 3 shows a process flow diagram of operation of the file server sys- 

16 tern. 

17 

18 A method 300 is performed by the components of the file server system 

19 1 00, and includes a set of flow points and process steps as described herein. 

20 

21 At a flow point 310, a device coupled to the file server system 100 desires 

22 to make a file system request. 



11 



Exp. Mail EK025321142US NAP-010 



1 

2 At a step 311, the device transmits a file system request to a selected net- 

3 work switch 130 coupled to the file server system 100. 

4 

5 At a step 3 12, the network switch 130 receives the file system request. The 

6 network switch 130 determines which mass storage device the request applies to, and 

7 determines which file server 110 is coupled to that mass storage device. The network 

8 switch 130 transmits the request to that file server 110 (that is, to both of its nodes 1 1 1 in 

9 parallel), using the interconnect 120. 



10 



11 At a step 313, the file server 110 receives the file system request. Each 

12 node 1 1 1 at the file server 110 queues the request for processing. 

13 

14 At a step 3 14, one of the two nodes 1 1 1 at the file server 1 1 0 processes the 

15 file system request and responds thereto. The other one of the two nodes 1 1 1 at the file 

16 server 1 10 discards the request without further processing. 

17 

18 At a flow point 320, the file system request has been successfully proc- 

19 essed. 



20 
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1 If any single point of failure occurs between the requesting device and the 

2 mass storage device to which the file system request applies, the file server system 100 is 

3 still able to process the request and respond to the requesting device. 



4 

5 • If either one of the network switches 130 fails, the other network switch 130 is 

6 able to receive the file system request and transmit it to the appropriate file server 

7 110. 

y 9 • If any link in the interconnect 120 fails, the remaining links in the interconnect 

jTto 120 are able to transmit the message to the appropriate file server 110. 

31 

; J12 • If either node 111 at the file server 110 fails, the other node 1 1 1 is able to process 

mi3 the file system request using the appropriate mass storage device. Because nodes 

flu 111 at each file server 110 are coupled in pairs, each file server 110 is highly 

15 available. Because file servers 110 are coupled together for managing collections 

16 of mass storage devices, the entire system 100 is scalable by addition of file serv- 

17 ers 110. Thus, each cluster of file servers 1 10 is scalable by addition of file serv- 

18 ers 110. 

19 

20 • If any one of the mass storage devices (other than the actual target of the file sys- 

21 tern request) fails, there is no effect on the ability of the other mass storage devices 
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1 to respond to processing of the request, and there is no effect on either of the two 

2 nodes 111 which process requests for that mass storage device. 
3 

4 Alternative Embodiments 
5 

6 Although preferred embodiments are disclosed herein, many variations are 

7 possible which remain within the concept, scope, and spirit of the invention, and these 

8 variations would become clear to those skilled in the art after perusal of this application. 
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l Claims 

2 

3 1 . A file server system including 

4 a plurality of file server nodes; 

5 at least one inter-node connectivity element coupled to said plurality of 

6 nodes; 

7 at least one switch coupled to said plurality of nodes and disposed for cou- 

8 pling file server commands to ones thereof; 

^ 9 said nodes including a set of pairs, each said pair being coupled to a set of 

iio storage elements and being disposed to control said storage elements in response to said 

H l file server commands. 

|Ul2 

jljl3 2. A system as in claim 1, wherein at least some of said pairs are dis- 

?14 posed for failover from a first node to a second node. 
15 

16 3. A system as in claim 1, wherein each said node includes a processor 

17 and a memory. 
18 

19 4. A system as in claim 1, wherein 

20 each said storage element corresponds to one said pair; 

21 each said storage element is coupled to both nodes in said corresponding 

22 pair; 
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1 whereby both nodes in said corresponding pair are equally capable of con- 

2 trolling said storage element. 

3 

4 5. A system as in claim 1, wherein said connectivity element includes a 

5 NUMA network. 
6 

7 6. A system as in claim 1, wherein said file server system is scalable by 

^ 8 addition of a set of pairs of said nodes. 

■I 1 9 

4o 7. A system as in claim 1, wherein said set of storage elements coupled 

;5i l to at least one said pair includes a RAID storage system. 

•312 

3 3 8. A system as in claim 1, wherein 

£i4 each pair includes a first node and a second node; 

15 each pair is disposed to receive file server commands directed to either said 

16 first node or to said second node; 

17 each pair is disposed when said file server commands are directed to said 

18 first node to execute said file server commands at said first node and to store a copy of 

19 said file server commands at said second node; and 

20 each pair is disposed when said file server commands are directed to said 

21 second node to execute said file server commands at said second node and to store a copy 

22 of said file server commands at said first node. 



16 
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1 

2 9. A system as in claim 8, wherein 

3 each said pair is disposed when said file server commands are directed to 

4 said first node and said first node is inoperable to execute said file server commands at 

5 said second node; and 

6 each pair is disposed when said file server commands are directed to said 

7 second node and said second node is inoperable to execute said file server commands at 

8 said first node. 
9 

10 1 0. A system as in claim 1 , wherein 

1 1 each pair is disposed to receive a file server command; 

12 each pair is disposed so that a first node responds to said file server com- 

13 mand while a second node records said file server command; and 

14 each pair is disposed to failover from said first node to said second node. 
15 

16 1 1 . A system as in claim 1 0, wherein 

17 each pair is disposed to receive a second file server command; 

18 each pair is disposed so that said second node responds to said second file 

19 server command while said first node records said file server command; and 

20 each pair is disposed to failover from said first node to said second node. 

21 
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1 12. A system as in claim 10, wherein said first node controls said storage 

2 elements in response to said file server command while said second node is coupled to 

3 said storage elements and does not control said storage elements in response to said file 

4 server command. 

5 

6 13. A method of operating a file server system, said method including 

7 steps for 

8 operating a plurality of file server nodes in a set of pairs, each said pair be- 

9 ing responsive to a set of file server commands; 

10 coupling said file server commands to said pairs; 

n coupling a set of messages between ones of said nodes in a first said pair 

12 and ones of said nodes in a second said pair. 

13 

14 14. A method as in claim 13, including steps for failover from a first 

15 node to a second node, and from said second node to said first node, in each said pair. 
16 

17 15. A method as in claim 13, including steps for scaling said file server 

18 by addition of a set of pairs of said nodes. 
19 

20 16. A method as in claim 13, including steps for controlling a set of 

21 storage elements corresponding to one said pair from either node in said pair. 

22 

18 



Exp. Mail EK025321142US NAP-010 



1 17. A method as in claim 16, including steps for operating said set of 

2 storage elements according to a RAID storage method. 

3 

4 1 8. A method as in claim 13, including steps for 

5 receiving file server commands directed to either a first node or to a second 

6 node in each said pair; 

7 when said file server commands are directed to said first node, responding 
^ 8 to said file server commands at said first node and storing a copy of said file server com- 
i j] 9 mands at said second node; and 

UJio when said file server commands are directed to said second node, respond- 



%i l ing to said file server commands at said second node and storing a copy of said file server 
Ql2 commands at said first node. 

!1 13 



. ^ 14 19. A method as in claim 1 8, including steps for 

15 when said file server commands are directed to said first node and said first 

16 node is inoperable, responding to said file server commands at said second node using 

17 said copy at said second node; and 

18 when said file server commands are directed to said second node and said 

19 second node is inoperable, responding to said file server commands at said first node us- 

20 ing said copy at said first node. 

21 

22 20. A method as in claim 13, including steps for 
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1 receiving a file server command at one said pair; 

2 responding to said file server command at a first node while recording said 

3 file server command at a second node; and 

4 failing over from said first node to said second node. 

5 

6 2 1 . A method as in claim 20, including steps for 

7 receiving a second file server command at said one pair; 

8 responding to said file server command at said second node while recording 
! said file server command at said first node; and 

l ip failing over from said first node to said second node. 
<§l 

;^2 22. A method as in claim 20, including steps for controlling said storage 

; B3 elements in response to said file server command by said first node while said second 

94 node is coupled to said storage elements and does not control said storage elements in re- 

15 sponse to said file server command. 
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l Abstract of the Disclosure 
2 

3 The invention provides a file server system and a method for operating that 

4 system, which is easily scalable in number and type of individual components. A plural- 

5 ity of file servers are coupled using inter-node connectivity, such as an inter-node net- 

6 work, so that any one node can be accessed from any other node. Each file server in- 

7 eludes a pair of file server nodes, each of which has a memory and each of which con- 

8 ducts file server operations by simultaneously writing to its own memory and to that of 

9 its twin, the pair being used to simultaneously control a set of storage elements such as 

10 disk drives. File server requests directed to particular mass storage elements are routed 

1 1 among file servers using an inter-node switch and processed by the file servers control- 

12 ling those particular storage elements. The mass storage elements are disposed and con- 

13 trolled to form a redundant array, such as a RAID storage system. The inter-node net- 

14 work and inter-node switch are redundant, so that no single point of failure prevents ac- 

15 cess to any individual storage element. The file servers are disposed and controlled to 

16 recognize failures of any single element in the file server system and to provide access to 

17 all mass storage elements despite any such failures. 
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